Comparing
floating point numbers
Bruce
Dawson
This article is obsolete.
Its replacement can be found by clicking on Awesome Floating Point Comparisons.
Please update your links.
I mean it. Some of the problems with this code include aliasing problems, integer overflow, and an attempt to extend the ULPs based technique further than really makes sense. The series of articles listed above covers the whole topic, but the key article that demonstrates good techniques for floating-point comparisons can be found here. This article also includes a cool demonstration, using sin(double(pi)), of why the ULPs technique and other relative error techniques breaks down around zero.
In short, stop reading. Click this link.
Okay, you've been warned. The remainder of this article exists purely for historical reasons.
Floating
point math is not exact. Simple values like 0.2 cannot be precisely represented
using binary floating point numbers, and the limited precision of floating
point numbers means that slight changes in the order of operations can change
the result. Different compilers and CPU architectures store temporary results
at different precisions, so results will differ depending on the details of
your environment. If you do a calculation and then compare the results against
some expected value it is highly unlikely that you will get exactly the result
you intended.
In other
words, if you do a calculation and then do this comparison:
if (result == expectedResult)
then it is unlikely that the comparison will be true. If the comparison is true then it is probably unstable – tiny changes in the input values, compiler, or CPU may change the result and make the comparison be false.
Since
floating point calculations involve a bit of uncertainty we can try to allow
for this by seeing if two numbers are ‘close’ to each other. If you decide –
based on error analysis, testing, or a wild guess – that the result should
always be within 0.00001 of the expected result then you can change your
comparison to this:
if (fabs(result - expectedResult) < 0.00001)
The maximum
error value is typically called epsilon.
Absolute
error calculations have their place, but they aren’t what is
most often used. When talking about experimental error it is more common to
specify the error as a percentage. Absolute error is used less often because if
you know, say, that the error is 1.0 that tells you very little. If the result
is one million then an error of 1.0 is great. If the result is 0.1 then an
error of 1.0 is terrible.
With the
fixed precision of floating point numbers in computers there are additional
considerations with absolute error. If the absolute error is too small for the
numbers being compared then the epsilon comparison may have no effect, because
the finite precision of the floats may not be able to represent such small
differences.
Let's say
you do a calculation that has an expected answer of about 10,000. Because
floating point math is imperfect you may not get an answer of exactly 10,000 -
you may be off by one or two in the least significant bits of your result. If
you're using 4-byte floats and you're off by one in the least significant bit
of your result then instead of 10,000 you'll get +10000.000977. So we have:
float expectedResult = 10000;
float
result = +10000.000977; // The closest
4-byte float to 10,000 without being 10,000
float
diff = fabs(result - expectedResult);
diff is
equal to 0.000977, which is 97.7 times larger than our epsilon. So, our
comparison tells us that result and expectedResult
are not nearly equal, even though they are adjacent floats! Using an epsilon
value 0.00001 for float calculations in this range is meaningless – it’s the same
as doing a direct comparison, just more expensive.
Absolute
error comparisons have value. If the range of the expectedResult
is known then checking for absolute error is simple and effective. Just make
sure that your absolute error value is larger than the minimum representable difference
for the range and type of float you’re dealing with.
An error of
0.00001 is appropriate for numbers around one, too big for numbers around
0.00001, and too small for numbers around 10,000. A more generic way of
comparing two numbers – that works regardless of their range, is to check the
relative error. Relative error is measured by comparing the error to the
expected result. One way of calculating it would be like this:
relativeError = fabs((result - expectedResult) / expectedResult);
If result
is 99.5, and expectedResult is 100, then the relative
error is 0.005.
Sometimes
we don’t have an ‘expected’ result, we just have two
numbers that we want to compare to see if they are almost equal. We might write
a function like this:
// Non-optimal AlmostEqual function - not recommended.
bool AlmostEqualRelative(float A, float B, float maxRelativeError)
{
if (A == B)
return true;
float relativeError = fabs((A - B) / B);
if (relativeError <= maxRelativeError)
return true;
return false;
}
The maxRelativeError parameter specifies what relative error we
are willing to tolerate. If we want 99.999% accuracy then we should pass a maxRelativeError of 0.00001.
The initial
comparison for A == B may seem odd – if A == B then won’t relativeError
be zero? There is one case where this will not be true. If A
and B are both equal to zero then the relativeError
calculation will calculate 0.0 / 0.0. Zero divided by zero is undefined, and
gives a
The trouble
with this function is that AlmostEqualRelative(x1,
x2, epsilon) may not give the result as AlmostEqualRelative(x2,
x1, epsilon), because the second parameter is always used as the divisor. An
improved version of AlmostEqualRelative would always
divide by the larger number. This function might look like this;
// Slightly better AlmostEqual function – still not recommended
bool AlmostEqualRelative2(float A, float B, float maxRelativeError)
{
if (A == B)
return true;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A - B) / B);
else
relativeError = fabs((A - B) / A);
if (relativeError <= maxRelativeError)
return true;
return false;
}
Even now our
function isn’t perfect. In general this function will behave poorly for numbers
around zero. The positive number closest to zero and the negative number
closest to zero are extremely close to each other, yet this function will correctly
calculate that they have a huge relative error of 2.0. If you want to count
numbers near zero but of opposite sign as being equal then you need to add a maxAbsoluteError check also. The function would then return
true if either the absoluteError or the relativeError were smaller than the maximums passed in. A
typical value for this backup maxAbsoluteError would
be very small – FLT_MAX or less, depending on whether the platform supports subnormals.
// Slightly better AlmostEqual function – still not recommended
bool AlmostEqualRelativeOrAbsolute(float A, float B,
float maxRelativeError, float maxAbsoluteError)
{
if (fabs(A - B) < maxAbsoluteError)
return true;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A - B) / B);
else
relativeError = fabs((A - B) / A);
if (relativeError <= maxRelativeError)
return true;
return false;
}
There is an
alternate technique for checking whether two floating point numbers are close
to each other. Recall that the problem with absolute error checks is that they
don’t take into consideration whether there are any values in the range being
checked. That is, with an allowable absolute error of 0.00001 and an expectedResult of 10,000 we are saying that we will accept
any number in the range 9,999.99999 to 10,000.00001, without realizing that
when using 4-byte floats there is only one
representable float in that range – 10,000. Wouldn’t it be handy if we could
easily specify our error range in terms of how many floats we want in that
range? That is, wouldn’t it be convenient if we could say “I think the answer
is 10,000 but since floating point math is imperfect I’ll accept the 5 floats
above and the 5 floats below that value.”
It turns
out there is an easy way to do this.
The IEEE
float and double formats were designed so that the numbers are
“lexicographically ordered”, which – in the words of IEEE architect William Kahan means “if two floating-point numbers in the same
format are ordered ( say x < y ), then they are
ordered the same way when their bits are reinterpreted as Sign-Magnitude
integers.”
This means
that if we take two floats in memory, interpret their bit pattern as integers,
and compare them, we can tell which is larger, without doing a floating point
comparison. In the C/C++ language this comparison looks like this:
if (*(int*)&f1 < *(int*)&f2)
This
charming syntax means take the address of f1, treat it as an integer pointer,
and dereference it. All those pointer operations look expensive, but they
basically all cancel out and just mean ‘treat f1 as an integer’. Since we apply
the same syntax to f2 the whole line means ‘compare f1 and f2, using their
in-memory representations interpreted as integers instead of floats’.
Kahan
says that we can compare them if we interpret them as sign-magnitude integers.
That’s unfortunate because most processors these days use twos-complement
integers. Effectively this means that the comparison only works if one or more
of the floats is positive. If both floats are negative
then the sense of the comparison is reversed – the result will be the opposite
of the equivalent float comparison. Later we will see that there is a handy
technique for dealing with this inconvenience.
Because the
floats are lexicographically ordered that means that if we increment the
representation of a float as an integer then we move to the next float. In
other words, this line of code:
(*(int*)&f1) += 1;
will
increment the underlying representation of a float and, subject to certain
restrictions, will give us the next float. For a positive number this means the
next larger float, for a negative number this means the next smaller float. In
both cases it gives us the next float farther away from zero.
We can
apply this logic in reverse also. If we subtract the integer representations of
two floats then that will tell us how close they are. If the difference is
zero, they are identical. If the difference is one, they are adjacent floats.
In general, if the difference is n then there are n-1 floats between them.
The chart
below shows some floating point numbers and the integer stored in memory that
represents them. It can be seen in this chart that the five numbers near 2.0
are represented by adjacent integers. This demonstrates the meaning of
subtracting integer representations, and also shows that there are no floats
between 1.99999988 and 2.0.
|
Representation |
|
Float value |
Hexadecimal |
Decimal |
+1.99999976 |
0x3FFFFFFE |
1073741822 |
+1.99999988 |
0x3FFFFFFF |
1073741823 |
+2.00000000 |
0x40000000 |
1073741824 |
+2.00000024 |
0x40000001 |
1073741825 |
+2.00000048 |
0x40000002 |
1073741826 |
With this knowledge
of the floating point format we can write this revised AlmostEqual
implementation:
// Initial AlmostEqualULPs version - fast and simple, but
// some limitations.
bool AlmostEqualUlps(float A, float B, int maxUlps)
{
assert(sizeof(float) == sizeof(int));
if (A == B)
return true;
int intDiff = abs(*(int*)&A - *(int*)&B);
if (intDiff <= maxUlps)
return true;
return false;
}
It’s certainly a lot simpler, especially when you look at all the divides and calls to fabs() that it’s not doing!
The last parameter to this function is different from the
previous AlmostEqual. Instead of passing in maxRelativeError as a ratio we pass in the maximum error in
terms of Units in the
If two numbers are identical except for a one-bit difference in the last digit of their mantissa then this function will calculate intDiff as one.
If one number is the maximum number for a particular exponent – perhaps 1.99999988 – and the other number is the smallest number for the next exponent – 2.0 – then this function will again calculate intDiff as one.
In both cases the two numbers are the closest floats there are.
There is not a completely direct translation between maxRelativeError and maxUlps. For a normal float number a maxUlps of 1 is equivalent to a maxRelativeError of between 1/8,000,000 and 1/16,000,000. The variance is because the accuracy of a float varies slightly depending on whether it is near the top or bottom of its current exponent’s range. This can be seen in the chart of numbers near 2.0 – the gap between numbers just above 2.0 is twice as big as the gap between numbers just below 2.0.
Our AlmostEqualUlps function starts by checking whether A and B are equal – just like AlmostEqualRelative did, but for a different reason that will be discussed below.
In our last version of AlmostEqualUlps we use pointers and casting to tell the compiler to treat the in-memory representation of a float as an int. There are a couple of things that can go wrong with this. One risk is that int and float might not be the same size. A float should be 32 bits, but an int can be almost any size. This is certainly something to be aware of, but every modern compiler that I am aware of has 32-bit ints. If your compiler has ints of a different size, find a 32-bit integral type and use it instead.
Another complication comes from aliasing optimizations. Strictly speaking the C/C++ standard says that the compiler can assume that different types do not overlap in memory (with a few exceptions such as char pointers). For instance, it is allowed to assume that a pointer to an int and a pointer to a float do not point to overlapping memory. This opens up lots of worthwhile optimizations, but for code that violates this rule—which is quite common—it leads to undefined results. In particular, some versions of g++ default to very strict aliasing rules, and don’t like the techniques used in AlmostEqualUlps.
Luckily g++ knows that there will be a problem, and it gives this warning:
warning: dereferencing type-punned pointer will break strict-aliasing rules
There are two possible solutions if you encounter this problem. Turn off the strict aliasing option using the -fno-strict-aliasing switch, or use a union between a float and an int to implement the reinterpretation of a float as an int. The documentation for -fstrict-aliasing gives more information.
Floating point math is never simple. AlmostEqualUlps doesn’t properly deal with all the peculiar types of floating point numbers. Whether it deals with them well enough depends on how you want to use it, but an improved version will often be needed.
IEEE floating point numbers fall into a few categories:
AlmostEqual is designed to deal with normal numbers, and it is there that it behaves its best. Its first problem is when dealing with zeroes. IEEE floats can have both positive and negative zeroes. If you compare them as floats they are equal, but their integer representations are quite different – positive 0.0 is an integer zero, but negative zero is 0x80000000! (in decimal this is -2147483648). The chart below shows the positive and negative floats closest to zero, together with their integer representations.
|
Representation |
|
Float value |
Hexadecimal |
Decimal |
+4.2038954e-045 |
0x00000003 |
3 |
+2.8025969e-045 |
0x00000002 |
2 |
+1.4012985e-045 |
0x00000001 |
1 |
+0.00000000 |
0x00000000 |
0 |
-0.00000000 |
0x80000000 |
-2147483648 |
-1.4012985e-045 |
0x80000001 |
-2147483647 |
-2.8025969e-045 |
0x80000002 |
-2147483646 |
-4.2038954e-045 |
0x80000003 |
-2147483645 |
In AlmostEqualUlps I deal with this by checking for A and B being exactly equal, thus handling the case where one input is positive zero and the other is negative zero. However this still isn’t perfect. With this implementation positive zero and the smallest positive subnormal will be calculated as being one ulp apart, and therefore will generally count as being equal. However negative zero and the smallest positive subnormal will be counted as being far apart, thus destroying the idea that positive and negative zero are identical.
A more general way of handling negative numbers is to adjust them so that they are lexicographically ordered as twos-complement integers instead of as signed magnitude integers. This is done by detecting negative numbers and subtracting them from 0x80000000. This maps negative zero to an integer zero representation – making it identical to positive zero – and it makes it so that the smallest negative number is represented by negative one, and downwards from there. With this change the representations of our numbers around zero look much more rational.
Remapping for twos complement |
||
|
Representation |
|
Float value |
Hexadecimal |
Decimal |
+4.2038954e-045 |
0x00000003 |
3 |
+2.8025969e-045 |
0x00000002 |
2 |
+1.4012985e-045 |
0x00000001 |
1 |
+0.00000000 |
0x00000000 |
0 |
-0.00000000 |
0x00000000 |
0 |
-1.4012985e-045 |
0xFFFFFFFF |
-1 |
-2.8025969e-045 |
0xFFFFFFFE |
-2 |
-4.2038954e-045 |
0xFFFFFFFD |
-3 |
Once we have made this adjustment we can no longer treat our numbers as IEEE floats – the values of the negative numbers will be dramatically altered – but we can compare them as ints more easily, in our new and convenient representation.
This technique has the additional advantage that now the distance between numbers can be measured across the zero boundary. That is, the smallest subnormal positive number and the smallest subnormal negative number will now compare as being very close – just a few ulps away. This is probably a good thing – it’s equivalent to adding an absolute error check to the relative error check. Code to implement this technique looks like this:
// Usable AlmostEqual function
bool AlmostEqual2sComplement(float A, float B, int maxUlps)
{
// Make sure maxUlps is non-negative and small enough that the
// default
assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
int aInt = *(int*)&A;
// Make aInt lexicographically ordered as a twos-complement int
if (aInt < 0)
aInt = 0x80000000 - aInt;
// Make bInt lexicographically ordered as a twos-complement int
int bInt = *(int*)&B;
if (bInt < 0)
bInt = 0x80000000 - bInt;
int intDiff = abs(aInt - bInt);
if (intDiff <= maxUlps)
return true;
return false;
}
The next potential issue is subnormals, also known as denormals. Subnormals are numbers that are so small that they cannot be normalized. This lack of normalization means that they have less precision – the closer they get to zero, the less precision they have. This means that when comparing two subnormals, an error of one ulp can imply a significant relative error – as great as 100%. However the interpretation of the ulps error as a measure of the number of representable floats between the numbers remains. Thus, this variation in the relativeError interpretation is probably a good thing – yet another advantage to this technique of comparing floating point numbers.
IEEE floating point numbers have a special representation for infinities. These are used for overflows and for the result of divide by zeroes. The representation for infinities is adjacent to the representation for the largest normal number. Thus, the AlmostEqualUlps routine will say that FLT_MAX and infinity are almost the same. This is reasonable in some sense – after all, there are no representable floats between them – but horribly inaccurate in another sense – after all, no finite number is ‘close’ to infinity.
If treating infinity as being ‘close’ to FLT_MAX is undesirable then an extra check is needed.
IEEE floating point numbers have a series of representations
for NANs. These representations – sharing an exponent with infinity but marked
by their non-zero mantissa – are numerically adjacent to the infinities when
compared as ints. Therefore it is possible for an
infinite result, or a FLT_MAX result, to compare as being very close to a
One other complication is that comparisons involving NANs
are always supposed to return false, but AlmostEqual2sComplement will say that
two NANs are equal to each other if they have the same integer representation.
If you rely on correct
maxUlps cannot be arbitrarily large. If maxUlps is four million or greater then there is a risk of finding large negative floats equal to NANs. If maxUlps is sixteen million or greater then the largest positive floats will compare as equal to the largest negative floats.
As a practical matter such large maxUlps values should not be needed. A maxUlps of sixteen million means that numbers 100% larger and 50% smaller should count as equal. A maxUlps of four million means that numbers 25% larger and 12.5% smaller should count as equal. If these large maxUlps values are needed then separate checking for wrap-around above infinity to NANs or numbers of the opposite sign will be needed. To prevent accidental usage of huge maxUlps values the comparison routines assert that maxUlps is in a safe range.
AlmostEqual2sComplement is very reliant on the IEEE floating point math format, and assumes twos-complement integers of the same size. These limitations are the norm on the majority of machines, especially consumer machines, but there are machines out there that use different formats. For this reason, and because the techniques used are tricky and non-obvious, it is important to encapsulate the behavior in a function where appropriate documentation, asserts, and conditional checks can be placed.
AlmostEqual2sComplement is an effective way of handling floating point comparisons. Its behavior does not map perfectly to AlmostEqualRelative, but in many ways its behavior is arguably superior. To summarize, AlmostEqual2sComplement has these characteristics:
If the special characteristics of AlmostEqual2sComplement are not desirable then they can selectively be checked for. A version with the necessary checks, #ifdefed for easy control of the behavior, is available here.
AlmostEqual2sComplement works best on machines that can transfer values quickly between the floating point and integer units. This often requires going through memory and can be quite slow. This function can be implemented efficiently on machines with vector units that can do integer or floating point operations on the same registers. This allows reinterpreting the values without going through memory.
The same techniques can be applied to doubles, mapping them to 64-bit integers. Because doubles have a 53-bit mantissa a one ulp error implies a relative error of between 1/4,000,000,000,000,000 and 1/8,000,000,000,000,000.
IEEE
Standard 754 Floating Point Numbers by
Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic by William Kahan
Source
code for compare functions and tests